Anomaly Sequences Detection from Logs Based on Compression 
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Mining information from logs is an old and still active research topic. In recent years, with 
the rapid emerging of cloud computing, log mining becomes increasingly important to industry. 
This paper focus on one major mission of log mining: anomaly detection, and proposes a novel 
method for mining abnormal sequences from large logs. Different from previous anomaly detection 
systems which based on statistics, probabilities and Markov assumption, our approach measures the 
strangeness of a sequence using compression. It first trains a grammar about normal behaviors using 
grammar-based compression, then measures the information quantities and densities of questionable 
sequences according to incrementation of grammar length. We have applied our approach on mining 
some real bugs from fine grained execution logs. We have also tested its ability on intrusion detection 
using some publicity available system call traces. The experiments show that our method successfully 
selects the strange sequences which related to bugs or attacking. 

PACS numbers: 



I. INTRODUCTION 

From trend analysing to system tuning, log mining tech- 
nique is widely used in commercial and research area. 
In recent years, diagnosing systems according to logs be- 
comes a hot research topic because of the rapid emerg- 
ing of cloud computing systemsir— . Problems in such 
systems are always non-deterministic because they are 
caused by uncontrollable conditions. Therefore, develop- 
ers can watch neither the execution paths nor the com- 
munications between different components as they used 
to be. The only cute can be used arc the text logs gen- 
erated by buggy systems. However, the huge size of such 
logs makes developers hard to deal with them. 

Aiming at non-deterministic problems, many ap- 
proaches are proposed. Record-replay^— is a hopeful one. 
Record-replay systems record low level execution detail 
during running. When debugging, they can replay the 
buggy execution according to those data, let developers 
to check control flow and data flow of the target pro- 
grams. Nevertheless, although recent record-replay tools 
can achieve low performance impact to be suitable for de- 
ploying into production environmentsii^, replaying 7x24 
long lasting logs and manually identifying the key parts 
of the execution flow is still an obstacle. 

The heart of the above problems is finding unusual pat- 
terns from large data set. This is the goal of intrusion 
detection. There are 2 general approaches of intrusion de- 
tection: misuse intrusion detection (MID) and anomaly 
intrusion detection (AID). MID models unusual behav- 
iors as specific patterns and identifies them from logs. 
However, MID systems are vulnerable against unknown 
abnormal behaviors. This paper focus on AID, which 
models normal behaviors and reports unacceptable de- 
viations. Anomaly detection has already been studied 
for decades. The related techniques are used for detec- 
tion of network intrusion and attackingiSrJ^. Those ap- 
proaches apply probability models and machine learning 
algorithmsiir— , most of them rely on Markov assump- 
tion. They have achieved positive results on some spe- 



cific type of logs such as system call tracesii^i^ii^. How- 
ever, although Markov assumption makes them sensitive 
to unusual state transitions at low level, they are short 
to identify high level misbehavior. 

This paper proposes a novel anomaly detection 
method. Different from the above approaches, our 
method doesn't rely on statistics, probabilities or Markov 
assumption, and needn't complex algorithms used in ma- 
chine learning. The principle of our approach is straight- 
forward: using compression to measure the information 
quantities of sequences. Our method can be used to find 
some high level abnormal behavior. To the best of our 
knowledge, our work is the first attempt to utilize the re- 
lationship between information quantities and compres- 
sion in mining unusual sequences from logs. 

We first introduce the principle of our approach using 
a simple example in section [Hi then present the detail 
algorithms in section IIIIl Section IIVI lists a set of exper- 
iments to show the ability of our method on bug finding 
and intrusion detection. Section fVl concludes the paper. 



II. OVERVIEW 

Our approach is inspired from following obvious fact. 

When the normal behavior of a information source is 
known to the viewer, she can describe another normal se- 
quence use only a few words, but needs more words to de- 
scribe an abnormal sequence. For example, the viewer is 
told that "1234 1235 1234 1235" are 4 normal sequences. 
For a questionable sequence "1235", she can describe it as 
"another type II sequence". The information contained 
in her description is only the "type 11". However, for se- 
quence "1237", the most elegant description should be 
"replace the forth character of normal pattern by 7". The 
information contained in this description is "forth" and 
"7". For sequence "32145", she has to say "a new se- 
quence, the first character is 3, the second character is 
2 The information contained in this description is 
much more than the previous two. Therefore, the viewer 
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TABLE I: The size of gzipped data 
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can infer that the last sequence is "the most strange one". 
An anomaly detection system should report the last se- 
quence among the three. 

In computer science there is a method to "describe a 
sequence" : compression. A compression algorithm can 
reduce the size of a sequence. For a long sequence, the 
compressed data can be thought as "the description of 
the original sequence" . It is well known that no com- 
pression algorithm is able to ultimately reduce the size 
of a sequence to zero. It is also well known that a com- 
pressed file is hard to be compressed again, entropy rate 
{H{X)) of the source data restricts the performance of a 
compression algorithm. 

Our approach utilizes the relationship between the in- 
formation quantity and the compressed data size. To 
select the "most strange" sequence, we can use the fol- 
lowing 3 steps: 

• Training: compress a set of normal sequences, the 
compressed data size is Qq. 

• Evaluating: for each candidate sequence, add it 
into the normal set used in training step, com- 
press the new set. The compressed size is Qn- Let 
In = Qn — Qo- 

• Selecting: the nth sequence which generates the 
largest Qn is selected. 

We use the previous example to demonstrate the above 
3 steps. The compression algorithm is gzip. 

The first row of table U shows the result of training. 
Sequence 1234123512341235 is combined from 4 normal 
sequences, gzip compresses it into 30 bytes. Following 
rows show the evaluating step. 3 questionable sequences 
are appended then gzipped. The incrementation of the 
three are 0, 1 and 5. The third step selects 32145 as "the 
most strange" one as it generates the largest In- 

III. DETAIL 

In this section we introduce our approach in detail. 

A. Grammar-based compression algorithm 

Although we use gzip algorithm to explain the principle 
of our anomaly detection method in section |lTl gzip and 



other well known generic compression algorithms are not 
suitable for digging anomaly sequences from execution 
logs because of the following reasons: 

• Nearly all well known compression algorithms (such 
as gzip, bzip2 and rar) are based on LZ77— , which 
uses sliding window to store recent data for match- 
ing incoming stream and ignore previous data. 
Sliding window is important for a generic compres- 
sion algorithm for compressing speed. However, it 
eliminates the historic knowledge about the data 
source, makes those data effectless when compress- 
ing new data. 

• Generic compression algorithms compress data as 
byte stream. Their alphabets are 256 possible bytes 
from 0x0 to Oxff. However, the unit of execution 
logs is log entry. A compression algorithm with al- 
phabet made by possible entries can discover mean- 
ingful patterns. 

• Generic compression algorithms are unable to iden- 
tify difference sequences when training and evalu- 
ating. Sequences in execution logs have to be stick 
together one by one. Some patterns will be cre- 
ated unintentionally across different sequences and 
affect the evaluating processing. In previous exam- 
ple, generic compression algorithms take pattern 
"34123" into account although it is not a part of 
any sequences. 

Our approach chooses a grammar-based codes as 
the underlying compression algorithm. Grammar-based 
compression algorithms are developed recent decades as a 
new way to losslessly compress datapi^"— . Kieffer et alJ^ 
firstly published some important theorems on it. This 
paper uses same symbols and terminologies to describe 
the algorithm. Yang et al presented a greedy grammar 
transform. Our algorithm is based on it. Such grammar 
transform is similar to SEQUITURF^ii^, but generates 
more compact grammars. 

The idea of grammar-based compression is simple: for 
stream x, one can represents it as a context-free gram- 
mar Gx which generates language {x} and takes much 
less space to store. Grammar-based compression is suit- 
able for compressing execution logs because such logs are 
generated by program hierarchically and are highly struc- 
tured. 



1. Grammar transform overview 

Grammar transform converts a sequence x into an ad- 
missible grammar Gx that represents x. An admissible 
grammar Gx is such a grammar which guarantees lan- 
guage L{G) = {x}- (The language of Gx contains only 
X-) We define G = {V, T, P, S) in which 

• is a finite nonempty set of non-terminals. 

• T is a finite nonempty set of terminals- 



3 



• P is a finite set of production rules. A production 
rule is an expression of tlie form A a, where 
AeV, ae {TU V)+. {a is nonempty). 

• S Cz V is start symbol. 

Define /g to be endomorphism on {V{G)UT{G))* sucli 
that: 

. faia) = a, a G T(G) 

• fa{A) = a, Ae V{G) and A^ae P(G) 

• fcie) - e 

• /g(uiU2) = fG{ui)fG{u2) 

Define a family of endomorphism {/'^ : fc = 0, 1, 2, ■ • • }: 

• fhi^) = fcix) 

Kicffer et al. showedi^ that, for an admissible gram- 
mar G:,, /<|y^'^="^'(u) e (r(G^))+ for each u G (V^(G,0 U 

T(G:,))+, and Z^^^*^"^' (G^(S')) = x. Informally speaking, 
for an admissible grammar Gx, by iteratively replacing 
non-terminals with the right side of corresponding pro- 
duction rules, every u e (V(G) U r(G))+ will finally be 
translated into a string which contains only terminals. 
Define a mapping fg' such that fciu) = /^^'"^■"(m) for 
each u e {V{G) U T{G))+ . Informally speaking, f^{u) 
is the original sequence represented by u. 

2. The greedy grammar transform algorithm 

The algorithm we used is based on following reduction 
rules (in following description, a and /? represent string 
in (y(G)UT(G))*): 

1. For an admissible grammar G, if there is a non- 
terminal A which appears at right side of produc- 
tion rules only once in P{G), let A ^ a be the pro- 
duction rule corresponding to A, let B — > /?iA/?2 
be the only rule which contain A in its right side, 
remove A from V{G) and remove A ^ a from 
P{G), then replace the production rule of B by 
B^ pia/32- 

2. For an admissible grammar G, if there is a pro- 
duction rule A — >■ ai/3Q;2/3a3 where |/3| > 1, add 
a new non-terminal B into V{G) then create a 
new rule B (3, replace the production of A by 
A aiBa2Ba3. 

3. For an admissible grammar G, if there are two 
production rules Ai and A2 that Ai — > ai(3a2 
and A2 — )■ asfia^, in witch |/3| > 1 and either 
|q;i| > or \a2\ > 0, either jaa] > or \a4\ > 0, 



#Transform x into an admissible grammar 
#returns the start rule by poi other rules by G 
def GrammarTransform (a; ) : 
G = {} 

po = SeqTransform (x , G) 
return po , G 

#1 is the sequence to be transform 
#G is a set of production rules 

#output: return the start symbol px so that fo'ipx) = x, 
# all other rules arc added into G 
def SeqTransform (z , G): 
Px = Sx ^ t 
while |x| > : 

#greedy read ahead and match 
for p in G; 

V = left side of p 

check whether f^{v) is x's prefix 
1 f matched : 

V = the longest matched nonterminal 
append v after the right side of px 
pop I (f) I entries from x 

else : 

pop one entry t from x 

append t as a terminal after the right side of px 
apply reduction rules 1-3 iteratively over GU {px}, 
until non of them can be applied, 
newly created rules are added into G 
return px 

FIG. 1: Greedy grammar transform algorithm 

add a new non-terminal B into V{G) then create 
a new rule B j3, replace the production of Ai 
by Al — >■ aiBa2, replace the production of A2 by 
A2 — i' a^Ba^. 

Figure [1] illustrates the grammar transform 
algorithroii. It is very similar to SEQUITURi^^ 
except the greedy read ahead step, which guarantees 
that in the generated grammar G, for different v G V{G), 
f^{v) are different. 

Algorithm in figure [T] transforms a sequence into a 
context-free grammar. To avoid patterns across differ- 
ent sequences interfering the processing, we wrap the al- 
gorithm as figure [2l In the wrapped algorithm, we can 
guarantee that every execution sequences are represented 
by an non-terminal in the right side of po- The reduction 
rules never consider patterns across sequences because po 
is not in G. In figure [2] we also show that our algorithm 
eliminates redundant sequences by dropping those results 
which contain only one symbol. 

In table [ll] we explain the above algorithm using an 
example of computing a grammar for 4 sequences 1234 
1235 1234 1237. The final grammar is listed at the last 
row in the table. 

We measure the quantities of information of a se- 
quence by computing the number of additional symbols 
which it introduces into the grammar. Figure [3] de- 
scribes the evaluating process. After a grammar gen- 
erated, EvaluateSequence is used to compute the infor- 
mation quantity (/) and information density D (average 
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# logs is a set of sequences 
def LogTransform ( logs ) : 
G = {} 
po = So ^ e 
for seijr in logs : 

Pn = SeqTransform ( seg , G) 
if p„ contains only one symbol in its right side : 
drop Pn 
continue 
insert p„ into G 

append pn after the right side of po 
return (po i G) 

FIG. 2: Transform a set of sequences 



# Count the number of total symbols which is needed for 

# describing all rules in rules 

def EvaluateRulcs ( rules , G): 
V = 

processedrules = {} 
for r in rules : 

if r is in processedrules: 
continue 

processedrules . insert (r) 

#r is a production rule with the form A a 

V += \a\ 

for symbol in a: 

i f symbol is nonterminal : 
Tji = G[ symbol] 
if r„ not in processedrules : 
rules . append (r„ ) 

return v 



TABLE 11: Example of 4 sequences: 1234 1235 1234 1237 

processed p„ G Po 

string 

begin process sequence 1234 

pi — > e {} po — > e 

1234 pi 1234 

begin process a new sequence 1235 
P2 -» e {pij.Pl ->■ 1234 Po -> pi 

12 p2 -> 12 

apply rule 3 on pattern 12 

P2 Pa {Pl,Pa},Pl -> Pa34 
Pa -» 12 



# X is the sequence which is to be evaluated 

# Po a-nd G are parameters of an already computed grammar 
def EvaluateSequence ( X , po , G): 

G' = G #deep copy 
Pn = SeqTransform (z , G' ) 
info_old = EvaluatcRules ( {po} , G) 
info_new = EvaluateRules ({po,Pn} i G') 
/ = info_new — info_old 
D = I / \x\ 
return / , D 



FIG. 3: Evaluation of a new sequence 

TABLE III: Evaluation of 2 sequences 
_G 2238 1239 

po P1P2P4 P2238 2pc8 P1239 ^ Pb9 

Pl, 123 Po ->■ P1P2P4 PO -S- P1P2P4 

Pl Pb'^ Pb -> IPc Pb 123 

P2 Pi,5 pc -s- 23 Pl -s> P(,4 

P4 Pb7 Pl -> Pb4 P2 P(,5 

P2 ->■ Pi,5 P4 -S> P67 

P4 -» Pb7 

12 symbols 16 symbols 14 symbols 
7=4 7=2 
D = 1 D = 0.5 



123 P2^Pa3 

apply rule 3 on pattern pa3 

P2 -> Pt {Pl,Pa,P!,},Pl Ptl 

Pa — > 12, Pi, pa3 

apply rule 1 on Pa 

{Pl.Ptl.Pl Ptl 
Pb 123 



1235 P2 -> Pi,5 


begin process a new sequence 1234 






P3 -> e {Pl,P6,P2},Pl -> P64 


Po - 


-> PlP2 


Pb 123, P2 Pb5 






look ahead greedy match: 






Pl and pfj matched, /<^(pi)| is the longest 






1234 P3 Pl 






begin process a new sequence 1237 






P3 contains only 1 symbol, eliminate p3 






P4 -> e 


Po - 


-> PlP2 


look ahead greedy match: 






Pb matched 






123 P4 -> P6 






1237 P4 -> P67 


finish processing 






{pi , pt, P2, P4}, Pl ^ Pb4 


Po - 


■> P1P2P4 


Pb — >■ 123, P2 Pb5 






P4 -i- Pb7 







symbols produced by an entry) of a sequence x. To illus- 
trates the evaluating process, we evaluate sequences 2238 
and 1239 using G generated by table llll 



B. Anomaly detection based on compression 

We introduced out anomaly detection algorithm in this 
subsection. 

The goal of the algorithm is to find abnormal sequences 
in given logs. The input are two data sets. One set 
contains some normal sequences, the other set contains 
questionable sequences. From the later set our algorithm 
reports abnormal sequences. 

The algorithm can be divided into following steps: 

1. Training: transform the normal set 5„ into an ad- 
missible grammar G with S(G) = pq. 

2. Evaluating: for each sequences i„ in ques- 
tionable set Sq, compute {It^,Dt^) using 
EvaluateSequence(t„, po , G) . 



From the above table, 2238 is more strange than 1239. 



3. Reporting: report mi sequences which generates 
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main :/../•■ /server . c : 516 
main : / . . / . • / server . c : 536 
main :/../•• /server . c : 538 
server_init :/../•• /server . c : 170 
server_init :/../•• /server . c : 172 



FIG. 4: Sample ReBranch trace 

largest mi It,^ , report m2 sequences which gener- 
ates largest m2 i?t„- mi and m2 are configurable. 

We report abnormal sequences according to both / and 
D because we believe they are both meaningful. A se- 
quence X which generates large Ix indicates that there 
is no a similar sequence in Sn- However, if a; is a very 
long sequence, the symbols used to describe x may be at 
very high level. Compare with a short sequence y with 
ly Ix, y is more valuable. 

IV. EXPERIMENTAL ANALYSIS 
A. Fine grained execution log 

We tested the ability of our method on finding bugs 
in fine grained execution log. The data sets we used are 
generated using ReBranch^. ReBranch is a record-replay 
tool for debugging. It records the outcome of all branch 
instructions when running, and replay the execution ac- 
cording to these logs for debugging. We converted the 
traces into line number sequences. Figure |4] shows a piece 
of sample trace. 

We tried our algorithm on finding two non- 
deterministic bugs in lighttpd (a light weight web 
server) and memcached (a key-value object caching sys- 
tem). 

In lighttpd bug 2217—, sometimes a few of CGI re- 
quests timeout. The bug is caused by a race condition: 
when a child process exits before the parent process is 
notified about the state of the corresponding pipe, the 
parent will wrongly remove the pipe from the event pool 
and never close the connection because it assumes the 
pipe still contain data. 

In our experiment on lighttpd bug, we first collected 
a trace with 500 correct requests for training, then col- 
lected another trace with 1000 requests for testing. 2 of 
these 1000 requests timeout. Traces are pre-processed 
to be divided into sequences. During the pre-processing, 
signal handling are removed. A sequence begin at the en- 
try of connection_statejnachine and end at the exit 
point of that function. After pre-processing, the normal 
trace contains 2501 sequences made by 3337395 entries, 
the questionable trace contains 4996 sequences made by 
6661425 entries. 

memcached bug 106^ is combined by 2 bugs. We 
first fixed a udp deadlock problem under the help of 
ReBranch. After that, when the cache server receives a 



TABLE IV: 


Test result of ReBranch data sets 


lighttpd 


train resuh: 




3337395 entries into 2793 symbols 


top most 5 I top most 5 D 


/654 = 41 


D654 = 0.039653 


^3990 = 41 


-D3990 = 0.039653 


^1172 = 3 


^655 = 0.019231 


h=l 


D3991 = 0.019231 


h = 1 


D22 = 0.018868 


memcached 


train result: 




862636 entries into 582 symbols 


top most 5 / top most 5 D 


^1237 = 27 


D1237 = 0.031765 


-^1608 = 27 


Dieos = 0.031765 


^609 = 27 


,01609 = 0.031765 


h=l 


Di = 0.009091 


h = 1 


De = 0.009091 



magic udp packet, some of following udp requests won't 
get reply. The problem is cause by incorrect state trans- 
fer, memcached uses a state machine when serving a 
request. The incorrect state transfer is connjread -> 
conn_closing. The correct transfer sequence is more 
complex. 

In memcached experiment, we first collected a trace 
with 1000 correct udp requests, then tried to identify 3 
buggy requests out of 1003 new requests. As previous 
experiment, we split traces into sequences. A sequence 
begin at the entry point of eventJiandler () and end at 
the exit point of that function. After splitting, normal 
trace data set contains 1442 sequences made by 862636 
entries; questionable trace contains 1609 sequences made 
by 883226 entries. 

The results of the above 2 experiments are listed in 
table HVl In lighttpd experiment, our algorithm find 2 
sequences (654 and 3990) with I and D quite larger than 
others. In memcached experiment, our algorithm find 3 
strange sequences (1237, 1608 and 1609). We confirmed 
those sequences are correct ones (buggy ones) by manu- 
ally replaying. 

It is hard to detect memcached 106 bug using tra- 
ditional Markov-based intrusion detection method be- 
cause the misbehavior is at a very high level. Markov- 
based methods only consider the probabilities of one en- 
try transfer to another entry. However, in this example, 
state transfer operation is implemented by many lines, 
each line transfer is valid. If developer know the dis- 
tance between the key lines which represent a state trans- 
fer, higher order Markov model or n-gram model can be 
used. Nevertheless, for different program, developer have 
to manually adjust the length of sliding window. Further- 
more, computing higher order model requires much more 
resources- always growths exponentially. 
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dtitti set 


traces 


procs 


cntriGS 


xlock-synth-unm 


71 


71 


339177 


xlock-intrusions 


2 


2 


949 


named-live 


1 


27 


9230572 


named-exploit 


2 


5 


1800 



229 2 
229 1 
370 66 
370 5 
370 63 



FIG. 5: UNM data set sample and size 



TABLE V: Test result of UNM data sets 
xlock train result: 266563 entries into 6149 symbols 
top most 5 normal sequences 2 exploited sequences 



h = 776 
/8 = 89 
h = 87 
h = 59 
ho = 46 



Dio = 0.107226 
Di = 0.049743 
Dg = 0.036998 
De = 0.032590 
D4 = 0.026375 



183 Dg 

176 D„ 



■ 0.372709 
: 0.380952 



named train result: 9215497 entries into 66148 symbols 
top most 5 normal sequences 5 exploited sequences 



h 


= 322 D2 


= 0.031078 


^12 


= 90 


^93 


= 0.311688 


h 


= 4 


Do, 


= 0.003537 




= 70 




= 0.297030 


I4, 


= 4 


D5 


= 0.003537 




= 24 




= 0.291667 


h 


= 4 


D4 


= 0.002093 




= 1 




= 0.001681 


h 


= 1 


Di 


= 0.001681 


Iqi 


= 1 


Dg, 


= 0.001681 



identified 3 strange sequences, information densities of 
them are at different order of magnitude. Tfie last 2 
sequences generate only 1 symbol {I ~ 1), indicates that 
same sequences have appeared in the training set at least 
once. After checking we found that those 2 processes are 
the parent processes used to setup daemons, none of them 
is target of attacks. 

TABLE VL Processing speed 



data set 


training 


evaluating 




time 


throughput 


time 


throughput 




(^) 


(ent/s) 


(«) 


(ent/s) 


lighttpd 


90.4 


36918.1 


496.3 


13422.2 


memcached 


10.1 


85409.5 


28.7 


30817.4 


xlock 


237.1 


1124.3 


166.7 


442.1 


named 


15742.6 


585.4 


189.9 


89.2 



B. System call sequences 



C. Performance 

Finally we list the throughput of our algorithm in table 
fvn It has been shown that SEQUITUR is a linear-time 
algorithmic. Our algorithm is similar to SEQUITUR ex- 
cept the read ahead matching. Such matching (match 
a long string against many shorter strings and find the 
longest match) can be optimized using a prefix tree. 



We used the data set published by the University of 
New Mexicoi^ to evaluate the ability of our algorithm on 
intrusions detection. The published data sets are system 
call traces generated using strace. 

We applied our algorithm on xlock and named data 
sets. Figure [5] shows the size of those data and some 
sample entries in those traces. A trace entry contains 
two numbers, the left one is process id, the right one 
is the system call number. A trace in UNM data set 
contains many processes. 

We use our algorithm to identify exploited processes. 
To achieve this, wc splittcd the original traces into system 
call sequences according to process id. The entries in each 
result sequences contain only the system call number. For 
xlock, we randomly selected 61 processes sequences for 
training then compared / and D of the other 12 sequences 
(10 normal, 2 exploited); for nsuned, wc chose 22 of normal 
sequences for training. The result is listed in table |Vl 

In xlock result, information density (D) of the two 
exploited sequences are 2 times larger than the largest 
density in normal set. In nemied result, our algorithm 



V. CONCLUSION 

In this paper we propose a novel anomaly detection al- 
gorithm by comparing the incrementation of compressed 
data length based on grammar-based compression. To 
the best of our knowledge, this is the first work which 
uses compression to measure the strangeness of sequences 
in anomaly detection. Different from Markov-based algo- 
rithm, our method utilizes the full knowledge about the 
structure of the data set. It can be used to find high level 
misbehavior as well as low level intrusions. We tested the 
algorithm on finding bugs in fine grained execution logs 
and intrusion detection in system call traces. In both 
data set, our method got positive result. The proposed 
method is also applicable to text log generated by today's 
cloud computing systems. 
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